Introduction

This IPython notebook illustrates how to refine the results of matching using triggers.

First, we need to import py_entitymatching package and other libraries as follows:


In [2]:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

Then, read the (sample) input tables for matching purposes.


In [3]:
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

path_A = datasets_dir + os.sep + 'dblp_demo.csv'
path_B = datasets_dir + os.sep + 'acm_demo.csv'
path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'

In [5]:
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')

# Load the pre-labeled data
S = em.read_csv_metadata(path_labeled_data, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')
S.head()


Out[5]:
_id ltable_id rtable_id ltable_title ltable_authors ltable_year rtable_title rtable_authors rtable_year label
0 0 l1223 r498 Dynamic Information Visualization Yannis E. Ioannidis 1996 Dynamic information visualization Yannis E. Ioannidis 1996 1
1 1 l1563 r1285 Dynamic Load Balancing in Hierarchical Parallel Database Systems Luc Bouganim, Daniela Florescu, Patrick Valduriez 1996 Dynamic Load Balancing in Hierarchical Parallel Database Systems Luc Bouganim, Daniela Florescu, Patrick Valduriez 1996 1
2 2 l1514 r1348 Query Processing and Optimization in Oracle Rdb Gennady Antoshenkov, Mohamed Ziauddin 1996 prospector: a content-based multimedia server for massively parallel architectures S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader 1996 0
3 3 l206 r1641 An Asymptotically Optimal Multiversion B-Tree Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger 1996 A complete temporal relational algebra Debabrata Dey, Terence M. Barron, Veda C. Storey 1996 0
4 4 l1589 r495 Evaluating Probabilistic Queries over Imprecise Data Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar 2003 Evaluating probabilistic queries over imprecise data Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar 2003 1

Use a ML Matcher to get Predictions

Here we will purposely create a decision tree matcher that does not take the several features into account to show later how triggers can be used to refine the model.


In [6]:
# Split S into I an J
IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)
I = IJ['train']
J = IJ['test']

In [7]:
# Create a Decision Tree Matcher
dt = em.DTMatcher(name='DecisionTree', random_state=0)

In [8]:
# Generate a set of features
feature_table = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)
feature_table


Out[8]:
feature_name left_attribute right_attribute left_attr_tokenizer right_attr_tokenizer simfunction function function_source is_auto_generated
0 id_id_lev_dist id id None None lev_dist <function id_id_lev_dist at 0x11b874aa0> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
1 id_id_lev_sim id id None None lev_sim <function id_id_lev_sim at 0x11b874d70> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
2 id_id_jar id id None None jaro <function id_id_jar at 0x11b874a28> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
3 id_id_jwn id id None None jaro_winkler <function id_id_jwn at 0x11b874c80> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
4 id_id_exm id id None None exact_match <function id_id_exm at 0x11b874de8> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
5 id_id_jac_qgm_3_qgm_3 id id qgm_3 qgm_3 jaccard <function id_id_jac_qgm_3_qgm_3 at 0x11b874e60> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
6 title_title_jac_qgm_3_qgm_3 title title qgm_3 qgm_3 jaccard <function title_title_jac_qgm_3_qgm_3 at 0x11b889050> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
7 title_title_cos_dlm_dc0_dlm_dc0 title title dlm_dc0 dlm_dc0 cosine <function title_title_cos_dlm_dc0_dlm_dc0 at 0x11b8890c8> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
8 title_title_mel title title None None monge_elkan <function title_title_mel at 0x11b889140> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
9 title_title_lev_dist title title None None lev_dist <function title_title_lev_dist at 0x11b8891b8> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
10 title_title_lev_sim title title None None lev_sim <function title_title_lev_sim at 0x11b889230> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
11 authors_authors_jac_qgm_3_qgm_3 authors authors qgm_3 qgm_3 jaccard <function authors_authors_jac_qgm_3_qgm_3 at 0x11b8892a8> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
12 authors_authors_cos_dlm_dc0_dlm_dc0 authors authors dlm_dc0 dlm_dc0 cosine <function authors_authors_cos_dlm_dc0_dlm_dc0 at 0x11b889320> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
13 authors_authors_mel authors authors None None monge_elkan <function authors_authors_mel at 0x11b889398> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
14 authors_authors_lev_dist authors authors None None lev_dist <function authors_authors_lev_dist at 0x11b889410> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
15 authors_authors_lev_sim authors authors None None lev_sim <function authors_authors_lev_sim at 0x11b889488> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
16 year_year_exm year year None None exact_match <function year_year_exm at 0x11b889500> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
17 year_year_anm year year None None abs_norm <function year_year_anm at 0x11b889578> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
18 year_year_lev_dist year year None None lev_dist <function year_year_lev_dist at 0x11b8895f0> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
19 year_year_lev_sim year year None None lev_sim <function year_year_lev_sim at 0x11b889668> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True

In [9]:
# We will remove many of the features here to purposly create a poor model. This will make it easier 
# to demonstrate triggers later
F = feature_table.drop([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
F


Out[9]:
feature_name left_attribute right_attribute left_attr_tokenizer right_attr_tokenizer simfunction function function_source is_auto_generated
0 id_id_lev_dist id id None None lev_dist <function id_id_lev_dist at 0x11b874aa0> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
15 authors_authors_lev_sim authors authors None None lev_sim <function authors_authors_lev_sim at 0x11b889488> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
16 year_year_exm year year None None exact_match <function year_year_exm at 0x11b889500> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
17 year_year_anm year year None None abs_norm <function year_year_anm at 0x11b889578> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
18 year_year_lev_dist year year None None lev_dist <function year_year_lev_dist at 0x11b8895f0> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True
19 year_year_lev_sim year year None None lev_sim <function year_year_lev_sim at 0x11b889668> from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ... True

In [10]:
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='label',
                            show_progress=False)
H.head()


Out[10]:
_id ltable_id rtable_id id_id_lev_dist authors_authors_lev_sim year_year_exm year_year_anm year_year_lev_dist year_year_lev_sim label
430 430 l1494 r1257 4 0.083333 1 1.0 0.0 1.0 0
35 35 l1385 r1160 4 0.271186 1 1.0 0.0 1.0 0
394 394 l1345 r85 4 0.338462 1 1.0 0.0 1.0 1
29 29 l611 r141 3 0.277778 1 1.0 0.0 1.0 0
181 181 l1164 r1161 2 0.244444 1 1.0 0.0 1.0 1

In [11]:
# Impute feature vectors with the mean of the column values.
H = em.impute_table(H, 
                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
                strategy='mean')

In [12]:
# Fit the decision tree to the feature vectors
dt.fit(table=H, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], target_attr='label')

In [13]:
# Use the decision tree matcher to predict if tuple pairs match
dt.predict(table=H, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], target_attr='predicted_labels', 
           return_probs=True, probs_attr='proba', append=True, inplace=True)
H.head()


Out[13]:
_id ltable_id rtable_id id_id_lev_dist authors_authors_lev_sim year_year_exm year_year_anm year_year_lev_dist year_year_lev_sim label predicted_labels proba
430 430 l1494 r1257 4.0 0.083333 1.0 1.0 0.0 1.0 0 0 0.0
35 35 l1385 r1160 4.0 0.271186 1.0 1.0 0.0 1.0 0 0 0.0
394 394 l1345 r85 4.0 0.338462 1.0 1.0 0.0 1.0 1 1 1.0
29 29 l611 r141 3.0 0.277778 1.0 1.0 0.0 1.0 0 0 0.0
181 181 l1164 r1161 2.0 0.244444 1.0 1.0 0.0 1.0 1 1 1.0

Debug the ML Matcher

Now we will use the debugger to determine what problems exist with our decision tree matcher.


In [14]:
# Split H into P and Q
PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)
P = PQ['train']
Q = PQ['test']

In [15]:
# Debug RF matcher using GUI
em.vis_debug_dt(dt, P, Q, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        target_attr='label')

# We see with the debugger that the false negatives have completely different values in the Title attribute.
# This is most likly because we removed all of the features that compare the Title attribute from each table earlier.

In [16]:
# We can see which tuples are not predicted correctly
H[H['label'] != H['predicted_labels']]


Out[16]:
_id ltable_id rtable_id id_id_lev_dist authors_authors_lev_sim year_year_exm year_year_anm year_year_lev_dist year_year_lev_sim label predicted_labels proba
371 371 l650 r1594 4.0 0.120000 1.0 1.0 0.0 1.0 1 0 0.500000
259 259 l938 r1090 5.0 0.200000 1.0 1.0 0.0 1.0 1 0 0.333333
346 346 l1681 r693 4.0 0.238095 1.0 1.0 0.0 1.0 1 0 0.500000
184 184 l891 r485 4.0 0.137931 1.0 1.0 0.0 1.0 1 0 0.500000
11 11 l1189 r1674 4.0 0.222222 1.0 1.0 0.0 1.0 1 0 0.250000
121 121 l169 r521 4.0 0.153846 1.0 1.0 0.0 1.0 1 0 0.500000
267 267 l120 r1181 4.0 0.216667 1.0 1.0 0.0 1.0 1 0 0.500000
147 147 l867 r1263 4.0 0.142857 1.0 1.0 0.0 1.0 1 0 0.333333

Using Triggers to Improve Results

This, typically involves the following steps:

  1. Creating the match trigger
  2. Adding Rules
  3. Adding a condition status and action
  4. Using the trigger to improve results

Creating the Match Trigger


In [17]:
# Use the constructor to create a trigger
mt = em.MatchTrigger()

Adding Rules

Before we can use the rule-based matcher, we need to create rules to evaluate tuple pairs. Each rule is a list of strings. Each string specifies a conjunction of predicates. Each predicate has three parts: (1) an expression, (2) a comparison operator, and (3) a value. The expression is evaluated over a tuple pair, producing a numeric value.


In [18]:
# Add two rules to the rule-based matcher

# Since we removed all of the features comparing Title earlier, we want to now add a rule that compares Titles
mt.add_cond_rule(['title_title_lev_sim(ltuple, rtuple) > 0.7'], feature_table)
# The rule has two predicates, one comparing the titles and the other looking for an exact match of the years
mt.add_cond_rule(['title_title_lev_sim(ltuple, rtuple) > 0.4', 'year_year_exm(ltuple, rtuple) == 1'], feature_table)
mt.get_rule_names()


Out[18]:
['_rule_0', '_rule_1']

In [19]:
# Rules can also be deleted from the rule-based matcher
mt.delete_rule('_rule_1')


Out[19]:
True

Adding a Condition Status and Action

Next, we need to add a condition status and an action to the trigger. Triggers apply the rules added to each tuple pair. If the result is the same value as the condition status, then the action will be carried out.


In [20]:
# Since we are using the trigger to fix a problem related to false negatives, we want the condition to be 
# True and the action to be 1. This way, the trigger will set a prediction to 1 when the rule returns True.

mt.add_cond_status(True)
mt.add_action(1)


Out[20]:
True

Using the Trigger to Improve Results

Now that we have added rules, a condition status, and an action, we can execute the trigger to improve results


In [21]:
preds = mt.execute(input_table=H, label_column='predicted_labels', inplace=False)
preds.head()


Out[21]:
_id ltable_id rtable_id id_id_lev_dist authors_authors_lev_sim year_year_exm year_year_anm year_year_lev_dist year_year_lev_sim label predicted_labels proba
430 430 l1494 r1257 4.0 0.083333 1.0 1.0 0.0 1.0 0 0 0.0
35 35 l1385 r1160 4.0 0.271186 1.0 1.0 0.0 1.0 0 0 0.0
394 394 l1345 r85 4.0 0.338462 1.0 1.0 0.0 1.0 1 1 1.0
29 29 l611 r141 3.0 0.277778 1.0 1.0 0.0 1.0 0 0 0.0
181 181 l1164 r1161 2.0 0.244444 1.0 1.0 0.0 1.0 1 1 1.0

In [22]:
# We were able to significantly reduce the number of incorrectly labeled tuple pairs
preds[preds['label'] != preds['predicted_labels']]


Out[22]:
_id ltable_id rtable_id id_id_lev_dist authors_authors_lev_sim year_year_exm year_year_anm year_year_lev_dist year_year_lev_sim label predicted_labels proba
11 11 l1189 r1674 4.0 0.222222 1.0 1.0 0.0 1.0 1 0 0.25
267 267 l120 r1181 4.0 0.216667 1.0 1.0 0.0 1.0 1 0 0.50

In [23]:
# We can see that the two tuples that are still labeled incorrectly are due to the title and authors being in the
# wrong column for one of the tuples.
pd.concat([S[S['_id'] == 11], S[S['_id'] == 267]])


Out[23]:
_id ltable_id rtable_id ltable_title ltable_authors ltable_year rtable_title rtable_authors rtable_year label
11 11 l1189 r1674 Weimin Du, Xiangning Liu, Abdelsalam Helal Multiview Access Protocols for Large-Scale Replication 1998 Multiview access protocols for large-scale replication Xiangning Liu, Abdelsalam Helal, Weimin Du 1998 1
267 267 l120 r1181 w. Bruce kroft, James callan, erik w. Brown fast incrremental indexiing for fulltext informtion retreval 1994 Fast Incremental Indexing For Full-Text Information Retrieval Eric W. Brown, James P. Callan, W. Bruce Croft 1994 1

In [ ]: